multi-modal transformer
- North America > United States > New York > Monroe County > Rochester (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- North America > United States > New York > Monroe County > Rochester (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Evaluating and Explaining Earthquake-Induced Liquefaction Potential through Multi-Modal Transformers
Youwai, Sompote, Kitkobsin, Tipok, Leelataviwat, Sutat, Jongpradist, Pornkasem
This study presents an explainable parallel transformer architecture for soil liquefaction prediction that integrates three distinct data streams: spectral seismic encoding, soil stratigraphy tokenization, and site-specific features. The architecture processes data from 165 case histories across 11 major earthquakes, employing Fast Fourier Transform for seismic waveform encoding and principles from large language models for soil layer tokenization. Interpretability is achieved through SHapley Additive exPlanations (SHAP), which decompose predictions into individual contributions from seismic characteristics, soil properties, and site conditions. The model achieves 93.75% prediction accuracy on cross-regional validation sets and demonstrates robust performance through sensitivity analysis of ground motion intensity and soil resistance parameters. Notably, validation against previously unseen ground motion data from the 2024 Noto Peninsula earthquake confirms the model's generalization capabilities and practical utility. Implementation as a publicly accessible web application enables rapid assessment of multiple sites simultaneously. This approach establishes a new framework in geotechnical deep learning where sophisticated multi-modal analysis meets practical engineering requirements through quantitative interpretation and accessible deployment.
- North America (0.68)
- Asia > Japan > Honshū > Chūbu (0.46)
Multi-Modal Transformer and Reinforcement Learning-based Beam Management
Ghassemi, Mohammad, Zhang, Han, Afana, Ali, Sediq, Akram Bin, Erol-Kantarci, Melike
Beam management is an important technique to improve signal strength and reduce interference in wireless communication systems. Recently, there has been increasing interest in using diverse sensing modalities for beam management. However, it remains a big challenge to process multi-modal data efficiently and extract useful information. On the other hand, the recently emerging multi-modal transformer (MMT) is a promising technique that can process multi-modal data by capturing long-range dependencies. While MMT is highly effective in handling multi-modal data and providing robust beam management, integrating reinforcement learning (RL) further enhances their adaptability in dynamic environments. In this work, we propose a two-step beam management method by combining MMT with RL for dynamic beam index prediction. In the first step, we divide available beam indices into several groups and leverage MMT to process diverse data modalities to predict the optimal beam group. In the second step, we employ RL for fast beam decision-making within each group, which in return maximizes throughput. Our proposed framework is tested on a 6G dataset. In this testing scenario, it achieves higher beam prediction accuracy and system throughput compared to both the MMT-only based method and the RL-only based method.
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- Asia > China (0.04)
Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification
Liu, Tengfei, Hu, Yongli, Gao, Junbin, Sun, Yanfeng, Yin, Baocai
Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)
Towards Multi-modal Transformers in Federated Learning
Sun, Guangyu, Mendieta, Matias, Dutta, Aritra, Li, Xin, Chen, Chen
Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.
Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data
Shen, Yiqiu, Park, Jungkyu, Yeung, Frank, Goldberg, Eliana, Heacock, Laura, Shamout, Farah, Geras, Krzysztof J.
Breast cancer screening, primarily conducted through mammography, is often supplemented with ultrasound for women with dense breast tissue. However, existing deep learning models analyze each modality independently, missing opportunities to integrate information across imaging modalities and time. In this study, we present Multi-modal Transformer (MMT), a neural network that utilizes mammography and ultrasound synergistically, to identify patients who currently have cancer and estimate the risk of future cancer for patients who are currently cancer-free. MMT aggregates multi-modal data through self-attention and tracks temporal tissue changes by comparing current exams to prior imaging. Trained on 1.3 million exams, MMT achieves an AUROC of 0.943 in detecting existing cancers, surpassing strong uni-modal baselines. For 5-year risk prediction, MMT attains an AUROC of 0.826, outperforming prior mammography-based risk models. Our research highlights the value of multi-modal and longitudinal imaging in cancer diagnosis and risk stratification.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York > New York County > New York City (0.04)
Cross Attention Transformers for Multi-modal Unsupervised Whole-Body PET Anomaly Detection
Patel, Ashay, Tudiosu, Petru-Danial, Pinaya, Walter H. L., Cook, Gary, Goh, Vicky, Ourselin, Sebastien, Cardoso, M. Jorge
Cancer is a highly heterogeneous condition that can occur almost anywhere in the human body. 18F-fluorodeoxyglucose is an imaging modality commonly used to detect cancer due to its high sensitivity and clear visualisation of the pattern of metabolic activity. Nonetheless, as cancer is highly heterogeneous, it is challenging to train general-purpose discriminative cancer detection models, with data availability and disease complexity often cited as a limiting factor. Unsupervised anomaly detection models have been suggested as a putative solution. These models learn a healthy representation of tissue and detect cancer by predicting deviations from the healthy norm, which requires models capable of accurately learning long-range interactions between organs and their imaging patterns with high levels of expressivity. Such characteristics are suitably satisfied by transformers, which have been shown to generate state-of-the-art results in unsupervised anomaly detection by training on normal data. This work expands upon such approaches by introducing multi-modal conditioning of the transformer via cross-attention i.e. supplying anatomical reference from paired CT. Using 294 whole-body PET/CT samples, we show that our anomaly detection method is robust and capable of achieving accurate cancer localization results even in cases where normal training data is unavailable. In addition, we show the efficacy of this approach on out-of-sample data showcasing the generalizability of this approach with limited training data. Lastly, we propose to combine model uncertainty with a new kernel density estimation approach, and show that it provides clinically and statistically significant improvements when compared to the classic residual-based anomaly maps. Overall, a superior performance is demonstrated against leading state-of-the-art alternatives, drawing attention to the potential of these approaches.
- Europe > United Kingdom (0.28)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Netherlands > Drenthe > Assen (0.04)
- (3 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Top 5 Interview Questions on Multi-modal Transformers
This article was published as a part of the Data Science Blogathon. Until recently, developing new, improved transformers specifically for a single modality was common practice. However, to tackle real-world tasks, there was a pressing need to develop multi-modal transformers models. Multi-modal transformers models are the type of models that employ the process of learning representations from different modalities using a single model. Input modalities for machine learning include photos, text, audio, etc. Multi-modal learning models the combination of different modalities of data, which often arise in real-world applications.
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
Crisol, Tomás, Ermantraut, Joel, Rostagno, Adrián, Aggio, Santiago L., Iparraguirre, Javier
During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-end multi-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, and visual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset. Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additional downstream tasks.
- South America > Argentina (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)